1
对比数据利用范式:标注谱
EvoClass-AI003Lecture 10
00:00

对比数据利用范式:标注谱

机器学习模型的成功部署,关键取决于标注数据的可得性、质量与成本。在人工标注昂贵、不可行或高度专业化的环境中,传统范式会变得低效甚至完全失效。我们引入标注谱,根据信息利用方式的不同,区分出三种核心方法: 监督学习(SL)无监督学习(UL)以及 半监督学习(SSL)

1. 监督学习(SL):高保真度,高成本

监督学习在每个输入 $X$ 都明确对应一个已知真实标签 $Y$ 的数据集上运行。尽管该方法通常能为分类或回归任务提供最高的预测准确率,但其对密集且高质量标注数据的依赖使其资源消耗巨大。当标注样本稀缺时,性能会急剧下降,导致该范式脆弱不堪,对于大规模、持续演化的数据集往往难以承受经济成本。

2. 无监督学习(UL):潜在结构发现

无监督学习仅在未标注数据 $D = \{X_1, X_2, ..., X_n\}$ 上运行。其目标是推断数据流形中的内在结构、底层概率分布、密度或有意义的表示。主要应用包括聚类、流形学习和表示学习。无监督学习在预处理和特征工程方面极为有效,无需外部人工干预即可提供有价值的洞见。

Question 1
Which learning paradigm is designed specifically to mitigate high reliance on expensive human data annotation by utilizing abundant unlabeled data?
Supervised Learning
Unsupervised Learning
Semi-Supervised Learning
Reinforcement Learning
Question 2
If a model's primary task is dimensionality reduction (e.g., finding the principal components) or clustering, which paradigm is universally employed?
Supervised Learning
Semi-Supervised Learning
Unsupervised Learning
Transfer Learning
Challenge: Defining the SSL Objective
Conceptualizing the Combined Loss Function
Unlike SL, which optimizes solely based on labeled fidelity, SSL requires a balanced optimization strategy. The total loss must capture prediction accuracy on the labeled set while enforcing consistency (e.g., smoothness or low density separation) across the unlabeled set.

Given: $D_L$: Labeled Data. $D_U$: Unlabeled Data. $\mathcal{L}_{SL}$: Supervised Loss function. $\mathcal{L}_{Consistency}$: Loss enforcing prediction smoothness on $D_U$.
Step 1
Write the general form of the total optimization objective $\mathcal{L}_{SSL}$, incorporating a weighting coefficient $\lambda$ for the unlabeled consistency component.
Solution:
The conceptual form of the total SSL loss is a weighted sum of the two components: $\mathcal{L}_{SSL} = \mathcal{L}_{SL}(D_L) + \lambda \cdot \mathcal{L}_{Consistency}(D_U)$. The scalar $\lambda$ controls the trade-off between label fidelity and structure reliance.